{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Relational Databases and SQL\n",
"
\n",
"\n",
"In the previous tutorial, we worked with data stored in CSV files. However, CSV files are inconvenient in many real-world scenarios. Data scientists commonly work on a team to analyze a shared dataset. For instance, an financial analyst group might receive new data on an minute basis. Instead of downloading a new CSV file every minute, data scientists prefer to use shared data storage that always reflects the most up-to-date data. \n",
"\n",
"**Database systems** are software systems specifically designed for large-scale data storage and retrieval. Industry, academic research, and governments all rely on database systems. One common and useful type of database system is an relational database management system (RDBMS). These systems allow data scientists to use a query language called SQL to quickly retrieve and process large amounts of data at once. In this chapter, we introduce the relational database model and SQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "11g0cRhWDgdW"
},
"source": [
"## The Relational Model \n",
"\n",
"A **database** is an organized collection of data. In the past, data was stored in specialized data structures that were designed for specific tasks. For example, airlines might record flight bookings in a different format than a bank managing an account ledger. In 1969, Ted Codd introduced the relational model as a general method of storing data. Data is stored in two-dimensional tables called **relations**, consisting of individual observations in each row (commonly referred to as **tuples**). Each tuple is a structured data item that represents the relationship between certain **attributes** (columns). Each attribute of a relation has a name and data type.\n",
"\n",
"Consider the `purchases` relation below:"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lUWBvrUoDgdZ"
},
"source": [
"\n",
"\n",
" \n",
" \n",
" name | \n",
" product | \n",
" retailer | \n",
" date purchased | \n",
"
\n",
" \n",
" \n",
" Samantha | \n",
" iPod | \n",
" Best Buy | \n",
" June 3, 2016 | \n",
"
\n",
" \n",
" Timothy | \n",
" Chromebook | \n",
" Amazon | \n",
" July 8, 2016 | \n",
"
\n",
" \n",
" Jason | \n",
" Surface Pro | \n",
" Target | \n",
" October 2, 2016 | \n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4c6jQUngDgdZ"
},
"source": [
"In `purchases`, each tuple represents the relationship between the `name`, `product`, `retailer`, and `date purchased` attributes. \n",
"\n",
"A relation's *schema* contains its column names, data types, and constraints. For example, the schema of the `purchases` table states that the columns are `name`, `product`, `retailer`, and `date purchased`; it also states that each column contains text.\n",
"\n",
"The following `prices` relation shows the price of certain gadgets at a few retail stores:"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3pm7rk5xDgdb"
},
"source": [
"\n",
"\n",
" \n",
" \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" Best Buy | \n",
" Galaxy S9 | \n",
" 719.00 | \n",
"
\n",
" \n",
" Best Buy | \n",
" iPod | \n",
" 200.00 | \n",
"
\n",
" \n",
" Amazon | \n",
" iPad | \n",
" 450.00 | \n",
"
\n",
" \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" Target | \n",
" iPod | \n",
" 215.00 | \n",
"
\n",
" \n",
" Target | \n",
" Surface Pro | \n",
" 799.00 | \n",
"
\n",
" \n",
" Target | \n",
" Google Pixel 2 | \n",
" 659.00 | \n",
"
\n",
" \n",
" Walmart | \n",
" Chromebook | \n",
" 238.79 | \n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-WMLV3OiDgdc"
},
"source": [
"We can then reference both tables simultaneously to determine how much Samantha, Timothy, and Jason paid for their respective gadgets (assuming prices at each store stay constant over time). Together, the two tables form a **relational database**, which is a collection of one or more relations.\n",
"The schema of the entire database is the set of schemas of the individual relations in the database."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Relational Database Management Systems\n",
"\n",
"A relational database can be simply described as a set of tables containing rows of individual data entries. A relational database management system (RDBMSs) provides an interface to a relational database. [Oracle](https://www.wikiwand.com/en/Oracle_Database), [MySQL](https://www.wikiwand.com/en/MySQL), and [PostgreSQL](https://www.wikiwand.com/en/PostgreSQL) are three of the most commonly used RDBMSs used in practice today.\n",
"\n",
"Relational database management systems give users the ability to add, edit, and remove data from databases. These systems provide several key benefits over using a collection of text files to store data, including:\n",
"\n",
"1. Reliable data storage: RDBMSs protect against data corruption from system failures or crashes.\n",
"1. Performance: RDBMSs often store data more efficiently than text files and have well-developed algorithms for querying data.\n",
"1. Data management: RDBMSs implement access control, preventing unauthorized users from accessing sensitive datasets.\n",
"1. Data consistency: RDBMSs can impose constraints on the data entered—for example, that a column `GPA` only contains floats between 0.0 and 4.0.\n",
"\n",
"To work with data stored in a RDBMS, we use the SQL programming language."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RDBMS vs. pandas\n",
"\n",
"How do RDBMSs and the `pandas` Python package differ? First, `pandas` is not concerned about data storage. Although DataFrames can read and write from multiple data formats, `pandas` does not dictate how the data are actually stored on the underlying computer like a RDBMS does. Second, `pandas` primarily provides methods for manipulating data while RDBMSs handle both data storage and data manipulation, making them more suitable for larger datasets. A typical rule of thumb is to use a RDBMS for datasets larger than several gigabytes. Finally, `pandas` requires knowledge of Python in order to use, whereas RDBMSs require knowledge of SQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lvsOEBDjDgdd"
},
"source": [
"## SQL\n",
"\n",
"**SQL** (Structured Query Language) is a programming language that has operations to define, logically organize, manipulate, and perform calculations on data stored in a relational database management system (RDBMS).\n",
"\n",
"SQL is a declarative language. This means that the user only needs to specify *what* kind of data they want, not *how* to obtain it. An example is shown below, with an imperative example for comparison:\n",
"\n",
"- **Declarative**: Compute the table with columns \"x\" and \"y\" from table \"A\" where the values in \"y\" are greater than 100.00.\n",
"- **Imperative**: For each record in table \"A\", check if the record contains a value of \"y\" greater than 100. If so, then store the record's \"x\" and \"y\" attributes in a new table. Return the new table.\n",
"\n",
"In this tutorial, we will write SQL queries as Python strings, then use `pandas` to execute the SQL query and read the result into a `pandas` DataFrame. As we walk through the basics of SQL syntax, we'll also occasionally show `pandas` equivalents for comparison purposes."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "QgEUYO2IDgdd"
},
"source": [
"### Executing SQL Queries through `pandas`\n",
"
\n",
"\n",
"To execute SQL queries from Python, we will connect to a database using the [sqlalchemy](http://docs.sqlalchemy.org/en/latest/core/tutorial.html) library. Then we can use the `pandas` function [pd.read_sql](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) to execute SQL queries through this connection."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 368
},
"colab_type": "code",
"executionInfo": {
"elapsed": 2288,
"status": "error",
"timestamp": 1523469699841,
"user": {
"displayName": "Ananth Agarwal",
"photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
"userId": "114213530163291820964"
},
"user_tz": 420
},
"id": "IS_NumI0Dgde",
"outputId": "c9e6faab-5640-43d0-a321-a23707868cdd"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sqlalchemy\n",
"\n",
"sqlite_uri = \"sqlite:///sql_basics.db\"\n",
"sqlite_engine = sqlalchemy.create_engine(sqlite_uri)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This database contains one relation: `prices`. To display the relation we run a SQL query. Calling `read_sql` will execute the SQL query on the RDBMS, then return the results in a `pandas` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
" Galaxy S9 | \n",
" 719.00 | \n",
"
\n",
" \n",
" 1 | \n",
" Best Buy | \n",
" iPod | \n",
" 200.00 | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
" iPad | \n",
" 450.00 | \n",
"
\n",
" \n",
" 3 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 4 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"0 Best Buy Galaxy S9 719.00\n",
"1 Best Buy iPod 200.00\n",
"2 Amazon iPad 450.00\n",
"3 Amazon Battery pack 24.87\n",
"4 Amazon Chromebook 249.99"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pd.read_sql takes in a parameter for a SQLite engine, which we create below\n",
"\n",
"sql_expr = \"\"\"\n",
"SELECT * \n",
"FROM Prices\n",
"\"\"\"\n",
"prices = pd.read_sql(sql_expr, sqlite_engine)\n",
"prices.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "jWfTN1elDgd1"
},
"source": [
"### SQL Syntax\n",
"
\n",
"\n",
"All SQL queries take the general form below:\n",
"```SQL\n",
"SELECT [DISTINCT] \n",
"FROM \n",
"[WHERE ]\n",
"[GROUP BY ]\n",
"[HAVING ]\n",
"[ORDER BY ]\n",
"[LIMIT ]\n",
"```\n",
"\n",
"```{note}\n",
"\n",
"1. **Everything in \\[square brackets\\] is optional.** A valid SQL query only needs a `SELECT` and a `FROM` statement.\n",
"2. **SQL SYNTAX IS GENERALLY WRITTEN IN CAPITAL LETTERS.** Although capitalization isn't required, it is common practice to write SQL syntax in capital letters. It also helps to visually structure your query for others to read.\n",
"3. `FROM` query blocks can reference one or more tables, although in this section we will only look at one table at a time for simplicity.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "l-L97SzrDgd2"
},
"source": [
"## SELECT and FROM\n",
"\n",
"The two mandatory statements in a SQL query are:\n",
"\n",
"* `SELECT` indicates the columns that we want to view.\n",
"* `FROM` indicates the tables from which we are selecting these columns.\n",
"\n",
"To display the entire `prices` table, we run:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "vt6rh2TyDgd3",
"outputId": "83c4e317-240b-4f7f-9acf-e5b8f5cae665"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
" Galaxy S9 | \n",
" 719.00 | \n",
"
\n",
" \n",
" 1 | \n",
" Best Buy | \n",
" iPod | \n",
" 200.00 | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
" iPad | \n",
" 450.00 | \n",
"
\n",
" \n",
" 3 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 4 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"0 Best Buy Galaxy S9 719.00\n",
"1 Best Buy iPod 200.00\n",
"2 Amazon iPad 450.00\n",
"3 Amazon Battery pack 24.87\n",
"4 Amazon Chromebook 249.99"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT * \n",
"FROM prices\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ebYQfMJTDgd6"
},
"source": [
"`SELECT *` returns every column in the original relation. To display only the retailers that are represented in `prices`, we add the `retailer` column to the `SELECT` statement."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 198
},
"colab_type": "code",
"executionInfo": {
"elapsed": 270,
"status": "error",
"timestamp": 1523470381414,
"user": {
"displayName": "Ananth Agarwal",
"photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
"userId": "114213530163291820964"
},
"user_tz": 420
},
"id": "xJZueetYDgd7",
"outputId": "0b27f335-3c7c-4102-d977-c3f6531800ac"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
"
\n",
" \n",
" 1 | \n",
" Best Buy | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
"
\n",
" \n",
" 3 | \n",
" Amazon | \n",
"
\n",
" \n",
" 4 | \n",
" Amazon | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer\n",
"0 Best Buy\n",
"1 Best Buy\n",
"2 Amazon\n",
"3 Amazon\n",
"4 Amazon"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT retailer\n",
"FROM prices\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2jYzN16_Dgd-"
},
"source": [
"If we want a list of unique retailers, we can call the `DISTINCT` function to omit repeated values."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "tkuqF63gDgd-",
"outputId": "93e6cd48-dba5-4218-9d79-22f2ccf332fa"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
"
\n",
" \n",
" 1 | \n",
" Amazon | \n",
"
\n",
" \n",
" 2 | \n",
" Target | \n",
"
\n",
" \n",
" 3 | \n",
" Walmart | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer\n",
"0 Best Buy\n",
"1 Amazon\n",
"2 Target\n",
"3 Walmart"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT DISTINCT(retailer)\n",
"FROM prices\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine).head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OQ7jpW05DgeE"
},
"source": [
"This would be the functional equivalent of the following `pandas` code:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "64uSDqguDgeG",
"outputId": "c91af9a1-41a0-4188-abb6-1099d9e0237b"
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Best Buy', 'Amazon', 'Target', 'Walmart'], dtype=object)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prices['retailer'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-Z8DWBHF84pT"
},
"source": [
"Each RDBMS comes with its own set of functions that can be applied to attributes in the `SELECT` list, such as comparison operators, mathematical functions and operators, and string functions and operators. Here we use PostgreSQL, a mature RDBMS that comes with hundreds of such functions, the complete list is available [here](https://www.postgresql.org/docs/9.2/static/functions.html). Keep in mind that each RDBMS has a different set of functions for use in `SELECT`.\n",
"\n",
"The following code converts all retailer names to uppercase and halves the product prices."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "9jDk28vp9bRs"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer_caps | \n",
" product | \n",
" half_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" BEST BUY | \n",
" Galaxy S9 | \n",
" 359.500 | \n",
"
\n",
" \n",
" 1 | \n",
" BEST BUY | \n",
" iPod | \n",
" 100.000 | \n",
"
\n",
" \n",
" 2 | \n",
" AMAZON | \n",
" iPad | \n",
" 225.000 | \n",
"
\n",
" \n",
" 3 | \n",
" AMAZON | \n",
" Battery pack | \n",
" 12.435 | \n",
"
\n",
" \n",
" 4 | \n",
" AMAZON | \n",
" Chromebook | \n",
" 124.995 | \n",
"
\n",
" \n",
" 5 | \n",
" TARGET | \n",
" iPod | \n",
" 107.500 | \n",
"
\n",
" \n",
" 6 | \n",
" TARGET | \n",
" Surface Pro | \n",
" 399.500 | \n",
"
\n",
" \n",
" 7 | \n",
" TARGET | \n",
" Google Pixel 2 | \n",
" 329.500 | \n",
"
\n",
" \n",
" 8 | \n",
" WALMART | \n",
" Chromebook | \n",
" 119.395 | \n",
"
\n",
" \n",
" 9 | \n",
" BEST BUY | \n",
" Galaxy S9 | \n",
" 359.500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer_caps product half_price\n",
"0 BEST BUY Galaxy S9 359.500\n",
"1 BEST BUY iPod 100.000\n",
"2 AMAZON iPad 225.000\n",
"3 AMAZON Battery pack 12.435\n",
"4 AMAZON Chromebook 124.995\n",
"5 TARGET iPod 107.500\n",
"6 TARGET Surface Pro 399.500\n",
"7 TARGET Google Pixel 2 329.500\n",
"8 WALMART Chromebook 119.395\n",
"9 BEST BUY Galaxy S9 359.500"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT\n",
" UPPER(retailer) AS retailer_caps,\n",
" product,\n",
" price / 2 AS half_price\n",
"FROM prices\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "VtdR6gZc-oP7"
},
"source": [
"```{note}\n",
"Notice that we can **alias** the columns (assign another name) with `AS` so that the columns appear with this new name in the output table. This does not modify the names of the columns in the source relation.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ZJlclDebDgeR"
},
"source": [
"### WHERE\n",
"\n",
"The `WHERE` clause allows us to specify certain constraints for the returned data; these constraints are often referred to as **predicates**. For example, to retrieve only gadgets that are under $500:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 198
},
"colab_type": "code",
"executionInfo": {
"elapsed": 284,
"status": "error",
"timestamp": 1523471405965,
"user": {
"displayName": "Ananth Agarwal",
"photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
"userId": "114213530163291820964"
},
"user_tz": 420
},
"id": "k5PYF5qvDgeR",
"outputId": "157bb2d9-bcbc-432d-a00a-6f0b8bb2df0c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
" iPod | \n",
" 200.00 | \n",
"
\n",
" \n",
" 1 | \n",
" Amazon | \n",
" iPad | \n",
" 450.00 | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 3 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" 4 | \n",
" Target | \n",
" iPod | \n",
" 215.00 | \n",
"
\n",
" \n",
" 5 | \n",
" Walmart | \n",
" Chromebook | \n",
" 238.79 | \n",
"
\n",
" \n",
" 6 | \n",
" Best Buy | \n",
" iPod | \n",
" 200.00 | \n",
"
\n",
" \n",
" 7 | \n",
" Amazon | \n",
" iPad | \n",
" 450.00 | \n",
"
\n",
" \n",
" 8 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 9 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"0 Best Buy iPod 200.00\n",
"1 Amazon iPad 450.00\n",
"2 Amazon Battery pack 24.87\n",
"3 Amazon Chromebook 249.99\n",
"4 Target iPod 215.00\n",
"5 Walmart Chromebook 238.79\n",
"6 Best Buy iPod 200.00\n",
"7 Amazon iPad 450.00\n",
"8 Amazon Battery pack 24.87\n",
"9 Amazon Chromebook 249.99"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT *\n",
"FROM prices\n",
"WHERE price < 500\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "yFCi_LlpDgeV"
},
"source": [
"We can also use the operators `AND`, `OR`, and `NOT` to further constrain our SQL query. To find an item on Amazon without a battery pack under $300, we write:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "9TLEQNhRDgeV",
"outputId": "b5dfaad1-4bed-492c-9ea2-2614b0066f1c"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" 1 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"0 Amazon Chromebook 249.99\n",
"1 Amazon Chromebook 249.99\n",
"2 Amazon Chromebook 249.99"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT *\n",
"FROM prices\n",
"WHERE retailer = 'Amazon'\n",
" AND NOT product = 'Battery pack'\n",
" AND price < 300\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6waniOVaDgeY"
},
"source": [
"The equivalent operation in `pandas` is:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" 13 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
" 22 | \n",
" Amazon | \n",
" Chromebook | \n",
" 249.99 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"4 Amazon Chromebook 249.99\n",
"13 Amazon Chromebook 249.99\n",
"22 Amazon Chromebook 249.99"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prices[(prices['retailer'] == 'Amazon') \n",
" & ~(prices['product'] == 'Battery pack')\n",
" & (prices['price'] <= 300)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lnCDsNMxDgec"
},
"source": [
"```{note}\n",
"There's a subtle difference that's worth noting: the index of the Chromebook in the SQL query is 0, whereas the corresponding index in the DataFrame is 4. This is because SQL queries always return a new table with indices counting up from 0, whereas `pandas` subsets a portion of the DataFrame `prices` and returns it with the original indices. We can use [pd.DataFrame.reset_index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) to reset the indices in `pandas`.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "m19xAQ6t0Waa"
},
"source": [
"### Aggregate Functions\n",
"\n",
"So far, we've only worked with data from the existing rows in the table; that is, all of our returned tables have been some subset of the entries found in the table. But to conduct data analysis, we'll want to compute aggregate values over our data. In SQL, these are called **aggregate functions**. \n",
"\n",
"If we want to find the average price of all gadgets in the `prices` relation:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "J3N1MxtKDgee",
"outputId": "2cfb1b75-fdf4-46e1-d786-174ef6a21c5e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" avg_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 395.072222 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" avg_price\n",
"0 395.072222"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT AVG(price) AS avg_price\n",
"FROM prices\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ind0BYFBDgeh"
},
"source": [
"Equivalently, in `pandas`:\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "5GlmVQ0uDgei",
"outputId": "a6bdf1b4-cd3c-49e2-9bc0-3bef03bd89ea"
},
"outputs": [
{
"data": {
"text/plain": [
"395.07222222222225"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prices['price'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "5oTbANu_5p59"
},
"source": [
"A complete list of PostgreSQL aggregate functions can be found [here](https://www.postgresql.org/docs/9.2/static/functions.html). Though we're using PostgreSQL as our primary version of SQL, keep in mind that there are many other variations of SQL (MySQL, SQLite, etc.) that use different function names and have different functions available."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7eVQXWJ8Dgek"
},
"source": [
"### GROUP BY and HAVING\n",
"\n",
"With aggregate functions, we can execute more complicated SQL queries. To operate on more granular aggregate data, we can use the following two clauses:\n",
"- `GROUP BY` takes a list of columns and groups the table like the [pd.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) function in `pandas`.\n",
"- `HAVING` is functionally similar to `WHERE`, but is used exclusively to apply conditions to aggregated data. (Note that in order to use `HAVING`, it must be preceded by a `GROUP BY` clause.)\n",
"\n",
"**Important**: When using `GROUP BY`, all columns in the `SELECT` clause must be either listed in the `GROUP BY` clause or have an aggregate function applied to them.\n",
"\n",
"We can use these statements to find the maximum price at each retailer."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "w7u9ixrbDgel",
"outputId": "084c2d9f-7fa5-46b3-a254-53e0c7bbf60d"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" max_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Amazon | \n",
" 450.00 | \n",
"
\n",
" \n",
" 1 | \n",
" Best Buy | \n",
" 719.00 | \n",
"
\n",
" \n",
" 2 | \n",
" Target | \n",
" 799.00 | \n",
"
\n",
" \n",
" 3 | \n",
" Walmart | \n",
" 238.79 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer max_price\n",
"0 Amazon 450.00\n",
"1 Best Buy 719.00\n",
"2 Target 799.00\n",
"3 Walmart 238.79"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT retailer, MAX(price) as max_price\n",
"FROM prices\n",
"GROUP BY retailer\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ypDYw58WDgen"
},
"source": [
"Let's say we have a client with expensive taste and only want to find retailers that sell gadgets over $700. Note that we must use `HAVING` to define conditions on aggregated columns; we can't use `WHERE` to filter an aggregated column. To compute a list of retailers and accompanying prices that satisfy our needs, we run:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "HAgxJxU-Dgeo",
"outputId": "9afeecc3-28c4-484d-9afd-88d37416e81d"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" max_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Best Buy | \n",
" 719.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Target | \n",
" 799.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer max_price\n",
"0 Best Buy 719.0\n",
"1 Target 799.0"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT retailer, MAX(price) as max_price\n",
"FROM prices\n",
"GROUP BY retailer\n",
"HAVING max_price > 700\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "fBv5gVKADgeq"
},
"source": [
"For comparison, we recreate the same table in `pandas`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "p1QTMs1JDgeq",
"outputId": "061e1559-0236-4e75-f0b7-ffce1c6f6e4e"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" price | \n",
"
\n",
" \n",
" retailer | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Best Buy | \n",
" 719.0 | \n",
"
\n",
" \n",
" Target | \n",
" 799.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" price\n",
"retailer \n",
"Best Buy 719.0\n",
"Target 799.0"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max_prices = prices.groupby('retailer').max()\n",
"max_prices.loc[max_prices['price'] > 700, ['price']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lQsEZZfWDgeu"
},
"source": [
"### ORDER BY and LIMIT\n",
"\n",
"These clauses allow us to control the presentation of the data:\n",
"- `ORDER BY` lets us present the data in alphabetical order of column values. By default, ORDER BY uses ascending order (`ASC`) but we can specify descending order using `DESC`.\n",
"- `LIMIT` controls how many tuples are displayed.\n",
"\n",
"Let's display the three cheapest items in our `prices` table:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "X7aFdIxZDgeu",
"outputId": "d4cd90f7-bb6f-45ec-c95c-d543e3408981"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 1 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 2 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"0 Amazon Battery pack 24.87\n",
"1 Amazon Battery pack 24.87\n",
"2 Amazon Battery pack 24.87"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT *\n",
"FROM prices\n",
"ORDER BY price ASC\n",
"LIMIT 3\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Xo31JonpDgex"
},
"source": [
"Note that we didn't have to include the `ASC` keyword since `ORDER BY` returns data in ascending order by default.\n",
"For comparison, in `pandas`:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
}
},
"colab_type": "code",
"id": "mJGnPtT7Dgex",
"outputId": "e57f4864-c559-439f-93a0-444d97094ba7"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" retailer | \n",
" product | \n",
" price | \n",
"
\n",
" \n",
" \n",
" \n",
" 3 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 21 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
" 12 | \n",
" Amazon | \n",
" Battery pack | \n",
" 24.87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" retailer product price\n",
"3 Amazon Battery pack 24.87\n",
"21 Amazon Battery pack 24.87\n",
"12 Amazon Battery pack 24.87"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prices.sort_values('price').head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "vFISvHXTDge1"
},
"source": [
"(Again, we see that the indices are out of order in the `pandas` DataFrame. As before, `pandas` returns a view on our DataFrame `prices`, whereas SQL is displaying a new table each time that we execute a query.)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ESRS1LaiAiRw"
},
"source": [
"### Conceptual SQL Evaluation\n",
"\n",
"Clauses in a SQL query are executed in a specific order. Unfortunately, this order differs from the order that the clauses are written in a SQL query. From first executed to last:\n",
"\n",
"1. `FROM`: One or more source tables\n",
"2. `WHERE`: Apply selection qualifications (eliminate rows)\n",
"3. `GROUP BY`: Form groups and aggregate\n",
"4. `HAVING`: Eliminate groups\n",
"5. `SELECT`: Select columns\n",
"\n",
"**Note on `WHERE` vs. `HAVING`**: Since the `WHERE` clause is processed before applying `GROUP BY`, the `WHERE` clause cannot make use of aggregated values. To define predicates based on aggregated values, we must use the `HAVING` clause."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "2id48Pl8wagw"
},
"source": [
"## SQL Joins"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "WOSG1ptQwagx"
},
"source": [
"In `pandas` we used the `pd.merge` method to join two tables using matching values in their columns. For example:\n",
"\n",
"```python\n",
"pd.merge(table1, table2, on='common_column')\n",
"```\n",
"\n",
"In this section, we introduce SQL joins. SQL joins are used to combine multiple tables in a relational database.\n",
"\n",
"Suppose we are cat store owners with a database for the cats we have in our store. We have **two** different tables: `names` and `colors`. The `names` table contains the columns `cat_id`, a unique number assigned to each cat, and `name`, the name for the cat. The `colors` table contains the columns `cat_id` and `color`, the color of each cat.\n",
"\n",
"```{note}\n",
"Note that there are some missing rows from both tables - a row with `cat_id` 3 is missing from the `names` table, and a row with `cat_id` 4 is missing from the `colors` table.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" cat_id | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Apricot | \n",
"
\n",
" \n",
" 1 | \n",
" Boots | \n",
"
\n",
" \n",
" 2 | \n",
" Cally | \n",
"
\n",
" \n",
" 4 | \n",
" Eugene | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" cat_id | \n",
" color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
" 3 | \n",
" white | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "GclJ2ZmswahF"
},
"source": [
"To compute the color of the cat named Apricot, we have to use information in both tables. We can *join* the tables on the `cat_id` column, creating a new table with both `name` and `color`."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LptZJOFGwahG"
},
"source": [
"### Joins"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7WgUyL0DwahJ"
},
"source": [
"A join combines tables by matching values in their columns.\n",
"\n",
"There are four main types of joins: inner joins, outer joins, left joins, and right joins. Although all four combine tables, each one treats non-matching values differently."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "z6qWoVSTwahJ"
},
"source": [
"### Inner Join"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "AUnNAn7ewahK"
},
"source": [
"Definition: In an inner join, the final table only contains rows that have matching columns in **both** tables.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "eQivHXNMwahL"
},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "nDUi3rOtwahM"
},
"source": [
"Example: We would like to join the `names` and `colors` tables together to match each cat with its color. Since both tables contain a `cat_id` column that is the unique identifier for a cat, we can use an inner join on the `cat_id` column.\n",
"\n",
"SQL: To write an inner join in SQL we modify our `FROM` clause to use the following syntax:\n",
"\n",
"```sql\n",
"SELECT ...\n",
"FROM \n",
" INNER JOIN \n",
" ON <...>\n",
"```\n",
"\n",
"For example:\n",
"\n",
"```sql\n",
"SELECT *\n",
"FROM names AS N\n",
" INNER JOIN colors AS C\n",
" ON N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
" \n",
" | \n",
" cat_id | \n",
" name | \n",
" cat_id | \n",
" color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" Apricot | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Boots | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" Cally | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4vixSMmlzoRk"
},
"source": [
"You may verify that each cat name is matched with its color. Notice that the cats with `cat_id` 3 and 4 are not present in our resulting table because the `colors` table doesn't have a row with `cat_id` 4 and the `names` table doesn't have a row with `cat_id` 3. In an inner join, if a row doesn't have a matching value in the other table, the row is not included in the final result."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3Yx_rCbo9Id2"
},
"source": [
"Assuming we have a DataFrame called `names` and a DataFrame called `colors`, we can conduct an inner join in `pandas` by writing:\n",
"\n",
"```python\n",
"pd.merge(names, colors, how='inner', on='cat_id')\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "u24A8PiXwahV"
},
"source": [
"### Full/Outer Join"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KxucuJmRwahW"
},
"source": [
"Definition: In a full join (sometimes called an outer join), **all records from both tables** are included in the joined table. If a row doesn't have a match in the other table, the missing values are filled in with `NULL`.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "U6nh8mm3wahX"
},
"source": [
"Example: As before, we join the `names` and `colors` tables together to match each cat with its color. This time, we want to keep all rows in either table even if there isn't a match.\n",
"\n",
"SQL: To write an outer join in SQL we modify our `FROM` clause to use the following syntax:\n",
"\n",
"```sql\n",
"SELECT ...\n",
"FROM \n",
" FULL JOIN \n",
" ON <...>\n",
"```\n",
"\n",
"For example:\n",
"\n",
"```sql\n",
"SELECT name, color\n",
"FROM names N\n",
" FULL JOIN colors C\n",
" ON N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iNDBJrVYwahf"
},
"source": [
"| cat_id | name | color |\n",
"| ------------- |---------------|-----------\n",
"| 0 | Apricot | orange |\n",
"| 1 | Boots | black |\n",
"| 2 | Cally | calico |\n",
"| 3 | NULL | white |\n",
"| 4 | Eugene | NULL |"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Sgs8fiCDwahg"
},
"source": [
"````{note}\n",
"Notice that the final output contains the entries with `cat_id` 3 and 4. If a row does not have a match, it is still included in the final output and any missing values are filled in with `NULL`.\n",
"\n",
"In `pandas`:\n",
"\n",
"```\n",
"pd.merge(names, colors, how='outer', on='cat_id')\n",
"```\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "u24A8PiXwahV"
},
"source": [
"### Left Join"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KxucuJmRwahW"
},
"source": [
"Definition: In a left join, all records from the **left table** are included in the joined table. If a row doesn't have a match in the right table, the missing values are filled in with `NULL`.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "U6nh8mm3wahX"
},
"source": [
"Example: As before, we join the `names` and `colors` tables together to match each cat with its color. This time, we want to keep all the cat names even if a cat doesn't have a matching color.\n",
"\n",
"SQL: To write an left join in SQL we modify our `FROM` clause to use the following syntax:\n",
"\n",
"```sql\n",
"SELECT ...\n",
"FROM \n",
" LEFT JOIN \n",
" ON <...>\n",
"```\n",
"\n",
"For example:\n",
"\n",
"```sql\n",
"SELECT name, color\n",
"FROM names N\n",
" LEFT JOIN colors C\n",
" ON N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iNDBJrVYwahf"
},
"source": [
"| cat_id | name | color |\n",
"| ------------- |---------------|-----------\n",
"| 0 | Apricot | orange |\n",
"| 1 | Boots | black |\n",
"| 2 | Cally | calico |\n",
"| 4 | Eugene | NULL |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"````{note}\n",
"Notice that the final output includes all four cat names. Three of the `cat_id`s in the `names` relation had matching `cat_id`s in the `colors` table and one did not (Eugene). The cat name that did not have a matching color has `NULL` as its color.\n",
"\n",
"In `pandas`:\n",
"\n",
"```\n",
"pd.merge(names, colors, how='left', on='cat_id')\n",
"```\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "u24A8PiXwahV"
},
"source": [
"### Right Join"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KxucuJmRwahW"
},
"source": [
"Definition: In a right join, all records from the **right table** are included in the joined table. If a row doesn't have a match in the left table, the missing values are filled in with `NULL`.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "U6nh8mm3wahX"
},
"source": [
"Example: As before, we join the `names` and `colors` tables together to match each cat with its color. This time, we want to keep all the cat color even if a cat doesn't have a matching name.\n",
"\n",
"SQL: To write a right join in SQL we modify our `FROM` clause to use the following syntax:\n",
"\n",
"```sql\n",
"SELECT ...\n",
"FROM \n",
" RIGHT JOIN \n",
" ON <...>\n",
"```\n",
"\n",
"For example:\n",
"\n",
"```sql\n",
"SELECT name, color\n",
"FROM names N\n",
" RIGHT JOIN colors C\n",
" ON N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "iNDBJrVYwahf"
},
"source": [
"| cat_id | name | color |\n",
"| ------------- |---------------|-----------\n",
"| 0 | Apricot | orange |\n",
"| 1 | Boots | black |\n",
"| 2 | Cally | calico |\n",
"| 3 | NULL | white |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time, observe that the final output includes all four cat colors. Three of the `cat_id`s in the `colors` relation had matching `cat_id`s in the `names` table and one did not (white). The cat color that did not have a matching name has `NULL` as its name.\n",
"\n",
"You may also notice that a right join produces the same result a left join with the table order swapped. That is, `names` left joined with `colors` is the same as `colors` right joined with `names`. Because of this, some SQL engines (such as SQLite) do not support right joins.\n",
"\n",
"In `pandas`:\n",
"\n",
"```\n",
"pd.merge(names, colors, how='right', on='cat_id')\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dBfgXDaH7cqG"
},
"source": [
"### Implicit Inner Joins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are typically multiple ways to accomplish the same task in SQL just as there are multiple ways to accomplish the same task in Python. We point out one other method for writing an inner join that appears in practice called an *implicit join*. Recall that we previously wrote the following to conduct an inner join:"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "nDUi3rOtwahM"
},
"source": [
"```sql\n",
"SELECT *\n",
"FROM names AS N\n",
" INNER JOIN colors AS C\n",
" ON N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An implicit inner join has a slightly different syntax. Notice in particular that the `FROM` clause uses a comma to select from two tables and that the query includes a `WHERE` clause to specify the join condition."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "nDUi3rOtwahM"
},
"source": [
"```sql\n",
"SELECT *\n",
"FROM names AS N, colors AS C\n",
"WHERE N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When multiple tables are specified in the `FROM` clause, SQL creates a table containing every combination of rows from each table. For example:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 541
},
"colab_type": "code",
"executionInfo": {
"elapsed": 220,
"status": "ok",
"timestamp": 1523507999500,
"user": {
"displayName": "Ashley Chien",
"photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
"userId": "103846389297761369572"
},
"user_tz": 420
},
"id": "c7jRg1XIA1BE",
"outputId": "14a1d990-d5d2-42f8-8ea1-a9c19b1f25af"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cat_id | \n",
" name | \n",
" cat_id | \n",
" color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" Apricot | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" Apricot | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" Apricot | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" Apricot | \n",
" 3 | \n",
" white | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Boots | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 5 | \n",
" 1 | \n",
" Boots | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Boots | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
" 7 | \n",
" 1 | \n",
" Boots | \n",
" 3 | \n",
" white | \n",
"
\n",
" \n",
" 8 | \n",
" 2 | \n",
" Cally | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 9 | \n",
" 2 | \n",
" Cally | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 10 | \n",
" 2 | \n",
" Cally | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
" 11 | \n",
" 2 | \n",
" Cally | \n",
" 3 | \n",
" white | \n",
"
\n",
" \n",
" 12 | \n",
" 4 | \n",
" Eugene | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 13 | \n",
" 4 | \n",
" Eugene | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 14 | \n",
" 4 | \n",
" Eugene | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
" 15 | \n",
" 4 | \n",
" Eugene | \n",
" 3 | \n",
" white | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cat_id name cat_id color\n",
"0 0 Apricot 0 orange\n",
"1 0 Apricot 1 black\n",
"2 0 Apricot 2 calico\n",
"3 0 Apricot 3 white\n",
"4 1 Boots 0 orange\n",
"5 1 Boots 1 black\n",
"6 1 Boots 2 calico\n",
"7 1 Boots 3 white\n",
"8 2 Cally 0 orange\n",
"9 2 Cally 1 black\n",
"10 2 Cally 2 calico\n",
"11 2 Cally 3 white\n",
"12 4 Eugene 0 orange\n",
"13 4 Eugene 1 black\n",
"14 4 Eugene 2 calico\n",
"15 4 Eugene 3 white"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_expr = \"\"\"\n",
"SELECT *\n",
"FROM names N, colors C\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "HyMDqz-GBFCT"
},
"source": [
"This operation is often called a *Cartesian product*: each row in the first table is paired with every row in the second table. Notice that many rows contain cat colors that are not matched properly with their names. The additional `WHERE` clause in the implicit join filters out rows that do not have matching `cat_id` values."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "nDUi3rOtwahM"
},
"source": [
"```sql\n",
"SELECT *\n",
"FROM names AS N, colors AS C\n",
"WHERE N.cat_id = C.cat_id;\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "nDUi3rOtwahM"
},
"source": [
"\n",
" \n",
" \n",
" | \n",
" cat_id | \n",
" name | \n",
" cat_id | \n",
" color | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" Apricot | \n",
" 0 | \n",
" orange | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Boots | \n",
" 1 | \n",
" black | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" Cally | \n",
" 2 | \n",
" calico | \n",
"
\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "agSjoQakCYiR"
},
"source": [
"## Joining Multiple Tables"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "R3wO34xvDRgk"
},
"source": [
"To join multiple tables, extend the `FROM` clause with additional `JOIN` operators. For example, the following table `ages` includes data about each cat's age.\n",
"\n",
"| cat_id | age | \n",
"| ------------- |---------|\n",
"| 0 | 4 | \n",
"| 1 | 3 | \n",
"| 2 | 9 | \n",
"| 4 | 20 | "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To conduct an inner join on the `names`, `colors`, and `ages` table, we write:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {
"autoexec": {
"startup": false,
"wait_interval": 0
},
"base_uri": "https://localhost:8080/",
"height": 141
},
"colab_type": "code",
"executionInfo": {
"elapsed": 352,
"status": "ok",
"timestamp": 1523507981867,
"user": {
"displayName": "Ashley Chien",
"photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
"userId": "103846389297761369572"
},
"user_tz": 420
},
"id": "NRxZUi7XDXwX",
"outputId": "941d7596-5b75-46c4-f511-6187b9c52ced"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" color | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Apricot | \n",
" orange | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" Boots | \n",
" black | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" Cally | \n",
" calico | \n",
" 9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name color age\n",
"0 Apricot orange 4\n",
"1 Boots black 3\n",
"2 Cally calico 9"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Joining three tables\n",
"\n",
"sql_expr = \"\"\"\n",
"SELECT name, color, age\n",
" FROM names n\n",
" INNER JOIN colors c ON n.cat_id = c.cat_id\n",
" INNER JOIN ages a ON n.cat_id = a.cat_id;\n",
"\"\"\"\n",
"pd.read_sql(sql_expr, sqlite_engine)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"
\n",
"\n",
"We have introduced SQL syntax and the most important SQL statements needed to conduct data analysis using a relational database management system.\n",
"\n",
"We have covered the four main types of SQL joins: inner, full, left, and right joins. We use all four joins to combine information in separate relations, and each join differs only in how it handles non-matching rows in the input tables."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}